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Abstract 

Graph sampling via crawling has been actively considered as a generic and important tool 
for collecting uniform node samples so as to consistently estimate and uncover various character- 
istics of complex networks. The so-called simple random walk with re-weighting (SRW-rw) and 
Metropolis-Hastings (MH) algorithm have been popular in the literature for such unbiased graph 
sampling. However, an unavoidable downside of their core random walks - slow diffusion over 
the space, can cause poor estimation accuracy. In this paper, we propose non-backtracking ran- 
dom walk with re-weighting (NBRW-rw) and MH algorithm with delayed acceptance (MHDA) 
which are theoretically guaranteed to achieve, at almost no additional cost, not only unbiased 
graph sampling but also higher efficiency (smaller asymptotic variance of the resulting unbiased 
estimators) than the SRW-rw and the MH algorithm, respectively. In particular, a remark- 
able feature of the MHDA is its applicability for any non-uniform node sampling like the MH 
algorithm, but ensuring better sampling efficiency than the MH algorithm. We also provide 
simulation results to confirm our theoretical findings. 

Keywords: unbiased graph sampling, random walks, non-reversible Markov chains, semi-Markov 
chains, asymptotic variance 

1 Introduction 



Estimating various nodal and topological properties of complex networks such as online social 
networks (OSNs), peer-to-peer (P2P) networks, and the world wide web (WWW) has recently 
attracted much attention from research community because of their ever-increasing popularity and 
importance in our daily life. However, the estimation of network characteristics is a non-trivial task, 
as these networks are typically too large to measure, making a complete picture of the network hard 
to obtain and even its size unknown. It is thus infeasible to perform 'independence sampling' which 
obtains uniform node samples (for unbiased estimation) directly and independently from such a 
large, unknown network. Instead, graph crawling techniques - graph sampling via crawling, have 
been widely used for that purpose. In particular, random walk-based graph sampling methods (or 
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Markov chain samplers) have become popular, as they are simple and implementable in a distributed 
fashion and also able to provide unbiased graph sampling, unlike the breadth-first-search (BFS) 
and its variants leading to unknown bias p21 [20] . 

In the literature, the most popular random walk-based graph sampling methods are the so- 
called simple random walk with re-weighting (SRW-rw) [29} [T2] and Metropolis-Hastings (MH) 
algorithm [251 [T6| l34l l29j \12\ I15j. The former launches a simple random walk (SRW) over a graph 
Q, which moves from a node to one of its neighbors chosen uniformly at random, to collect random 
node samples, followed by a re-weighting process in order to eliminate the bias caused by the 
non-uniform stationary distribution of the SRW. The other method is to rely on a Metropolis- 
Hastings random walk (MHRW) crawling over Q — a random walk achieving a unform distribution 
constructed by the famous MH algorithm [25, 16J, to obtain uniform node samples. 

Motivation and Contributions: While the SRW-rw and MH algorithm ensure unbiased graph 
sampling, the core components - SRW and MHRW, suffer from their slow diffusion over the space, 
which can in turn lead to poor estimation accuracy. In particular, their fully random nature in 
selecting the next node, when making a transition, often cause them to go back to the previous node 
from where they just came. This produces many duplicate samples for a short to moderate time 
span, thereby reducing estimation accuracy. It is apparently desirable to avoid such backtracking 
transitions whenever possible, so as to steer them toward 'unvisited' places (or to obtain new node 
samples), as long as such a modification does not affect the unbiased estimation. 

However, it is still uncertain how to achieve this at almost no additional cost and whether it 
really results in better estimation accuracy. We provide affirmative answers for these questions. 
Specifically, we propose non-backtracking random walk with re-weighting (NBRW-rw) and MH 
algorithm with delayed acceptance (MHDA), and prove that each of them guarantees not only 
unbiased graph sampling but also higher efficiency (smaller asymptotic variance of the estimators) 
than the SRW-rw and the MH algorithm, respectively. A notable feature of our MHDA is its 
generic purpose: the MHDA is theoretically guaranteed to enhance the standard MH algorithm for 
constructing a random walk or a Markov chain with any arbitrarily given stationary distribution 
under the constraints of graph structure. Thus, the MHDA is applied, as 'a special case', to 
construct a random walk crawling over a graph Q achieving a uniform stationary distribution, 
leading to higher efficiency than the MHRW while ensuring the unbiased estimation. To the best of 
our knowledge, this is the first theoretical result to improve, with proven guarantee, both SRW-rw 
and the MH algorithm for unbiased graph sampling. 

Related Work: Very recently, there have been a few attempts to improve the estimation accu- 
racy against the SRW-rw (not the MH algorithm) through multiple dependent random walks [30 1, 
a random walk on a weighted graph (with a priori estimate of network information) [20J, and the 
addition of random jumps (to anywhere in the graph) [5]. The corresponding Markov chains are 
time-reversible, whereas the main kernel of our proposed methods is transforming 'any' reversible 
Markov chain to its related non-reversible chain which avoids backtracking transitions and also 
achieves the same stationary distribution. Thus, our work is complementary to their approaches. 

On the other hand, there is a body of research works across many disciplines for speeding 
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up a random walk, or Markov chain, on a graph Q in terms of its mixing time, hitting time, 
and/or cover time. The fastest mixing (reversible) Markov chain on a graph is obtained in [8] with 
complete knowledge of entire graph. [91 [TO] showed that certain 'lifted' (non-reversible) Markov 
chains converge to their stationary distributions faster than their related reversible chain, and 
[TOl 122] subsequently applied this idea to design a fast and efficient average consensus algorithm. 
It is, however, still unknown how to construct such a 'lifted' Markov chain in a distributed or 
decentralized manner for a general graph. 

[31 [71 [T7] recently undertook speeding up a SRW based only on local information, but did not 
provide any direct implication to the unbiased graph sampling. As the MH algorithm is the most 
popular method of Markov Chain Monte Carlo (MCMC) simulations or samplers, it has been an 
active research topic to improve the MH algorithm in terms of the sampler performance (asymptotic 
variance) in the MCMC literature (e.g., [261 114[ [35]). However, most works toward more efficient 
MCMC samplers (including [23 Q31 [35] ) do not take into account graph-topological constraints in 
that transition from node i to j ^ i is allowed only when they are neighbors of each other, and thus 
cannot be directly applicable to unbiased graph sampling. 

Organization: The rest of the paper is organized as follows. We first provide an in-depth 
overview on generic Markov chain samplers for unbiased graph sampling in Section 2, and then 
briefly review the SRW-rw and MH algorithm in Section 3. In Section 4, we present a general 
recipe for the transformation of a time-reversible Markov to its related non-reversible Markov 
chain, which forms a common building block for our proposed NBRW-rw and MHDA. We then 
explain the details of the NBRW-rw and MHDA, and provide relevant analysis. In Section 5, we 
provide simulation results obtained based on real graphs to support our theoretical findings. We 
finally conclude in Section 6. 

2 Background on Markov Chain Samplers 
2.1 Unbiased Graph Sampling 

Consider a connected, undirected graph Q = (J\f, £) with a set of nodes (vertices) J\f = {1,2, ... ,n} 
and a set of edges £. We assume that 3 < \M\ =n < oo. We also assume that the graph Q has 
no self-loops and no multi-edges. Let N(i) = {j G M : 6 £} be the set of neighbors of node 
i G J\f, and d(i) = |-/V(i)| be the degree of node i. 

Unbiased graph (or node) sampling, via crawling, is to consistently estimate nodal or topological 
properties of a target graph c£] (e.g., an overlay network or an OSN) based upon uniform node 
samples obtained by a random walk (or possibly multiple random walks) crawling over the graph 
Q. The goal here is to unbiasedly estimate a proportion of the nodes with a specific characteristic. 
Thus, the unbiased, uniform graph sampling is, in principle, developing a random walk-based 
"estimator" or a Markov chain sampler for the expectation of any given, desired function / with 

* A target graph for sampling may be time-varying due to node join/leave, which is beyond the scope of this paper. 
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respect to a uniform distribution, i.e., 

»«(/) = £ (1) 

where u = [-u(l), u(2), . . . , u(n)] = [1/n, 1/n, . . . , 1/n]. Note that a nodal (or topological) character- 
istic of interest can be specified by properly choosing a function /. For example, for a target graph 
Q, if one is interested in estimating its degree distribution (say, P{-Dg = d}, d = 1, 2, . . . , n— 1), then 
choose a function / such that f(i) = l{dU)=d\ f° r * e ^ i- e -j /(*) = 1 if = d, and /(i) = 
otherwise. 

We below review a basic Markov chain theory which serves as the mathematical foundation 
for unbiased graph sampling via a (Markovian) random walk crawling over a graph Q. Define a 
random walk or a finite discrete-time Markov chain {Xt EjV,t = 0, 1, • • •} on the nodes of the graph 
Q with its transition matrix P = {P{i, j)}i ijeAf i n which 

P{i,j)=¥{X t+l =j\X t =i}, i,j G M, 

and P(i, j) = 1 for all i. Each edge G £ is associated with a transition probability 

P(i,j) > with which the chain (or random walk) makes a transition from node i to node j. We 
allow the chain to include self-transitions, i.e., P(i,i) > for some i, although Q has no self-loops. 
Clearly, P(i,j) = for all £ (i ^ j)- We then assume that the Markov chain {Xt} is 

irreducible, i.e., every node in M is reachable in finite time with positive probability, such that the 
chain has a unique stationary distribution 7r = [vr(l), 7r(2), . . . ,7r(n)]. 
For any function / : M— >-R, define an estimator 

k(f) = \jZf( X s) (2) 

1 s=l 

for the expectation of the function / with respect to it which is given by 

M/) = E ( 3 ) 

ieAf 

Then, the following Strong Law of Large Numbers (SLLN) (a.k.a., ergodic theorem) has been a 
fundamental basis for most of the random walk-based graph sampling methods in the literature [341 
[291 [HI QjS EDI U I2Q] , and more generally, MCMC samplers [281 H3 QH ED [27] . 

Theorem 1 jllty, \31^ Suppose that {Xt} is a finite, irreducible Markov chain with its stationary 
distribution n. Then, for any initial distribution F{Xq = i}, i G M, as t—too, 

P-t(f) ~^ ^n(f) almost surely (a.s.) 

for any function f with E^d/j) < oo. □ 

The SLLN ensures that the estimator jlt{f) based on any finite, irreducible Markov chain with 
the same 7r can serve as a valid and unbiased approximation of E^/). In particular, the two popular 
random walk-based graph sampling methods (or two different Markov chain samplers for unbiased 
graph sampling) in the networking literature - SRW-rw |29[ [12] and MH algorithm [34 [ [29l [T2"l [T5] 
are built upon the SLLN to asymptotically guarantee the unbiasedness of their estimators for E u (/). 
We will review in detail these two graph sampling methods in Section El 
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2.2 Central Limit Theorem and Asymptotic Variance 

For a given graph Q, there are potentially many (finite) irreducible Markov chains (or different 
random walks) preserving the same stationary distribution 7r, all of which can be used to obtain 
asymptotically unbiased estimates of E 7r (/), and also of E u (/), together with proper re- weighting 
if 7T ^ u. One important question would then be how to compare these 'competing' Markov chains, 
or rather, which one is 'better' or more efficient than the others as a Markov chain sampler for 
unbiased graph sampling. 

Mixing time can perhaps be a criterion to compare several irreducible, aperiodic Markov chains, 
all with the same stationary distribution. The mixing time captures the notion of the speed of con- 
vergence to the stationary distribution, and is typically defined via the total variation distance: for 
an irreducible, aperiodic Markov chain {Xt} with its transition matrix P and stationary distribution 
7T, the mixing time can be written as 

tmix(e) = min{t > 1 : max \\P\i, •) - tt\\tv < e}, 

ieJV 

where the total variation distance is defined by ||P*(z, •) — 7t||tv — ^sx-ACAf \P l (i, A) — tta\- Here, 
P t (i,A) denotes the t-step transition probability from node (state) i to subset AQj\F, and tta = 
YljeA 77 ^)- T ne mixing time has been actively studied in the literature, especially for irreducible, 
aperiodic, time-reversible^ Markov chains (e.g., [U EI]). In particular, it is now well known that 
the mixing time of such a Markov chain (or the asymptotic rate of convergence to its stationary 
distribution) is mainly governed by the second largest eigenvalue modulus (SLEM) - the second 
largest eigenvalue in absolute value, of its transition matrix, and smaller SLEM leads to smaller 
(faster) mixing time [U [21] . 

If the speed of convergence to the stationary distribution is a primary concern, then the mix- 
ing time is surely the right metric to compare different Markov chains with the same stationary 
distribution. However, this is not the case for the unbiased graph sampling. Random walk-based 
graph sampling methods typically adopt an initial burn-in period over which (initial) sampled val- 
ues are discarded to get rid of the dependence on the initial position of a random walk |12j . After 
such a burn-in period, the Markov chain (or random walk) will be close to its stationary regime 
(well mixed), but many samples are still yet to be obtained from this point onward. Therefore, the 
primary concern should be, instead, the efficiency of the estimator fit(f) in deciding how many ran- 
dom samples are required to achieve a certain accuracy of ftt(f) in regard to E,r(/) (and eventually 
to E u (/) after proper re- weighting if necessary). 

To that end, we define by cr 2 (f) the asymptotic variance of the estimator fit(f) based on an 
irreducible Markov chain {Xt} with its stationary distribution 7r, which is given by 

a 2 (f) 4 lim t ■ Var (&(/)) = lim \ E 

for any function / with E,r(/ 2 ) < oo, where the initial position (state) X$ is drawn from the 
stationary distribution 7r, i.e., Xq ~ 7r. Note that the asymptotic variance <J 2 (f) is, in fact, 

^ If the Markov chain {Xt} satisfies the reversibility condition (or detailed balance equation), i.e., n(i)P(i,j) = 
n(j)P(j,i) for all i,j, then the chain is called time-reversible. 
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independent of the distribution of the initial state Xq \28\ [31] . We below explain how effective the 
asymptotic variance o~ 2 (f) can be, through its connection to the Central Limit Theorem (CLT), in 
measuring the performance of the estimator /tt(/). 

Suppose first that the random samples X\,X2, . . . ,Xt are i.i.d. and drawn directly from it. 
Then, the standard Central Limit Theorem (CLT) says that 

Vt-MD-^Af)} =^ N(0,a 2 (/)), ast^oo, 

where => denotes convergence in distribution and N(0,<7 2 (/)) is a Gaussian random variable with 
zero mean and variance o~ 2 (f) =Var(/(Xi)). That is, the distribution of At(/) is asymptotically 
normal. For sufficiently large t, we also have 

which allows us to identify an approximate confidence interval for E w (/). For instance, for suffi- 
ciently large t, we can be 95% confident that E^/) is approximately between /tt(/) — 2(cr(/)/vi) 
and /tt(/) + 2(a(f)/y/i). This clearly demonstrates the importance of the asymptotic variance 
<j 2 (f) in conjunction with the CLT in assessing the accuracy of the estimator fit(f)- 

Not only for the above case with i.i.d. samples, the CLT holds also for Markov chains, as given 
below. 

Theorem 2 Jlffi For a finite, irreducible Markov chain {X t } with its stationary distribution 

Vt-[jlt(f)-^{f)} =^ N(0,a 2 (/)), ast^oo, 
for any function f with K n (f 2 ) < oo regardless of any initial distribution, and o~ 2 (f) is given by 
&■ □ 

Note that Theorems [TH2] (SLLN and CLT) do not require any assumption of aperiodicity |31j . 
However, for simplicity, we do not consider periodic Markov chains in our analysis throughout the 
paper. In addition, we focus on bounded functions / (and thus E,r(/ 2 ) < oo), which is typical in 
graph sampling applications. 

As shown above for i.i.d. samples, the CLT allows one to evaluate the asymptotic variance 
o~ 2 (f) in order to decide approximately how many (correlated) samples are required to achieve 
a certain accuracy of the estimator fit(f)- Hence, the asymptotic variance o~ 2 (f) has been an 
important criterion to rank the efficiency among competing Markov chains with the same 7r for 
the MCMC samplers [281 [261 [351 E3 EH [27], although quantifying a 2 (f) may not be easy. In 
particular, by noting that the asymptotic variance is independent of any initial distribution for 
which the CLT holds, the efficiency ordering over competing Markov chains with the same 7r (the 
smaller the asymptotic variance, the better the estimator performance) is still in effect even when 
the competing Markov chains are already in their stationary regimes (already 'mixed'). Observe 
that from Xq ~ it, Xt ~ 7r for all t (the chain {Xt} is in the stationary regime), and thus 
becomes 

oo 

a 2 (f) = Var(/(X )) + 2 £ Cov(/(X ), f(X k )), (5) 

k=i 
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where Cov(f(Xo),f(Xk)) = E{f(Xo)f(Xk)} — El^(/) denotes the covariance between /(Xq) and 
f(Xk). That is, even if the competing Markov chains are already in their stationary regimes, the 
correlation structure over random samples given by each of these Markov chains can vary and 
significantly affect their asymptotic variances. Observe that reducing the temporal correlation 
over random samples can lead to smaller asymptotic variances. This intuition can be leveraged to 
improve the existing Markov chain samplers for unbiased graph sampling. 

Motivated by the effectiveness of the asymptotic variance with its connection to the CLT, in 
this paper, we consider the asymptotic variance as a primary performance metric, and develop two 
random walk-based graph sampling methods, each of which guarantees the unbiased graph sampling 
with smaller asymptotic variance than its corresponding counterpart in the current networking 
literature. Before going into details, we next briefly review the existing two random walk-based 
graph sampling methods. 



3 Random Walk-based Graph Sampling 

3.1 Simple Random Walk with Re- weighting 

We first review the SRW-rw, a.k.a., respondent-driven sampling [33], which has been recently 
used in [29j H] f° r unbiased graph sampling. This method operates based upon a sequence of 
(correlated) random samples obtained by a SRW, together with a proper re-weighting process to 
ensure the unbiased sampling. It is essentially a special case of the importance sampling (a Monte 
Carlo method) applied for random samples generated by a Markov chain [61 [2"3"1 113] . While there 
are similar variants of such method (e.g., [21]), the main idea behind them is still to correct the 
sampling bias caused by the stationary distribution of the SRW. 

Consider a SRW on Q that moves from a node to one of its neighbors chosen uniformly at 
random (u.a.r.). Specifically, let {Xt} be the Markov chain representing the sequence of visited 
nodes by the SRW, with its transition matrix P = {P(i,j)}ij£M given by 

f^v If (i,j)e 8, 

PM = P 1 ' (6) 

I otherwise. 

It is well known that P is irreducible, and reversible with respect to a unique stationary distribution 
7T for which vr(i) = d(i)/(2\8\),i £ Af [2]. 

Suppose that there are t random samples {X s }* =1 from the SRW. Then, for a function of 
interest /, choose a weight function w : N — >M. such that 

«(») 2 \ £ \ 1 ■ A f 

TT(i) n d(i) 



Observe that from the SLLN in Theorem [IJ as t -> oo, 

t 

I 



1 * 

At(W0 = 7l>(^)/PQ -> M«>/)=Eu(/) a.s. 
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and thus the estimator fit(wf) is unbiased for E u (/). However, this estimator itself is not practical, 
since n and \£\ are typically unknown a priori. Instead, another estimator jlt{w f) / p,t{w) is often 
used as an unbiased estimator for E u (/). Indeed, Theorem [T] asserts that fit(wf) and fk(w) converge 
to E u (/) and 1 almost surely, as t — > oo, respectively. This yields 

£<(») «>(*.) y " 

Hence, the estimator fit(wf) / fit(w) can be made in such a way that we need to know w{%) only up to 
a multiplicative constant. That is, if we set w(i) = l/d(i), i^N, then the estimator p^(wf)/p^(w) 
remains intact, and is more practical as an unbiased estimator for K u (f). Throughout this paper, 
we refer to the estimator fit{w /) / jhiw) with w{i) = l/d{i) (iEj\F) as the unbiased estimator for 
E u (/) in the SRW-rw [221 [12]. 

As an example, for a target graph Q, choose a function / such that f(i) = lid(i)=d} for i € M 
in order to estimate the degree distribution f{Dg = d}. Then, for any given d, 

ik(wf) E*=i H^x^/diXs) 1 



IH(w) Y? s =xVd{X s ) . eN 



implying that the estimator fit{w f) / fit(w) yields a valid unbiased estimate of ^{Dg = d}. 
3.2 Metropolis-Hastings Algorithm 

The MH algorithm [25\ [To] was developed to construct a transition matrix P of a time-reversible 
Markov chain {Xt} with a given, desired stationary distribution tv. Here, we only discuss the MH 
algorithm under the topological constraints of a graph Q in that transition from node i to j ^ i is 
allowed only when they are neighbors of each other. The MH algorithm is defined as follows. At the 
current state i of X t , the next state X t +i is proposed with a proposal probability Q(i,j), which is 
a state transition probability of an arbitrary irreducible Markov chain on the state space Af, where 
Q(i,j) > if and only if Q(j,i) > 0, and Q(i,j) = for all £ (i^j). Let Q = {Q(i, j)}i,jeAf 
be a proposal (transition) matrix. The proposed state transition to Xt+i = j is accepted with an 
acceptance probability 

A{i,j) = mini 1, ) , (7) 



n(i)Q(i,j) 

and rejected with probability 1 — A(i,j) in which case Xt+\ =i. Thus, the transition probability 
P(i,j) becomes, for i^j, 

P(iJ) = Q(i,j)A(i,j) = min I Q(i,j),Q(j,i)^\, (8) 

with P(i,i) = i—^2j^P(i,j), which ensures that P is reversible with respect to n. Note that the 
uniqueness of tv is granted due to the irreducibility of Q (so is P) and the finite state space. 

The MH algorithm, in addition to its popular applications for MCMC simulation, has been 
also widely used as a means for unbiased graph sampling [3H [291 (El ES]. Specifically, the MH 
algorithm has been applied to construct a MHRW on Q achieving a uniform stationary distribution, 
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i.e., 7v = u. This is done with transition probabilities of a SRW as the proposal probabilities, i.e., 
Q(i,j) = l/d(i) if £ £ and Q(i,j) = 0, otherwise. The resulting transition probability of the 
MHRW on Q becomes 



and P[i,i) = ^-~^2j^iP(hj)- Thus, P is reversible with respect to iv = u, implying that for any 
function /, the estimator fit(f) based upon random samples by the MHRW is unbiased for E u (/). 
This version of MH algorithm is summarized in Algorithm [TJ where Xt £ N denotes the location 
of the MHRW at time t, and d(Xt) denotes the degree of node Xt- Here, Xq can be arbitrarily 
chosen. 

Algorithm 1 MH algorithm for MHRW (at time t) 
1: Choose node j u.a.r. from neighbors of Xt, i.e., N(Xt) 
2: Generate p ~ U(0, 1) 



4: X t+ 1 <r- j 

5: else 

6: X t+ i <— Xt 
7: end if 



Remark 1 It is worth noting that the MH algorithm (Algorithm^ does not need to know the self- 
transition probabilities P(i,i) explicitly, nor does it require all the neighbors' degree information of 
the current node Xt at each time t. Instead, only the degree information of the randomly chosen 
neighbor j is enough for making decision whether or not to move to j. 

Recall that the above unbiased estimators are based on t random, consecutive samples obtained 
under SRW or MHRW, respectively. Observe that the SRW, currently at node i at time s can 
'backtrack' to the previously visited node with probability l/d(i), i.e., X s+ \ = A s _i, trapping the 
SRW temporarily in a local region. The situation can be worse for the regions in which nodes 
have small degrees (so higher chance of backtracking). Similarly, the MHRW at node i can also 
backtrack to the previously visited node after staying at node i for some random time. This 
slow 'diffusion' of SRW/MHRW over the space can, in turn, lead to highly duplicated random 
samples for a short to moderate time duration, thereby increasing the variance of the unbiased 
estimators. Recall that the asymptotic variance in Q involves covariance terms Cov(/(Ao), f{Xj t )). 
Thus, it would be beneficial for both SRW and MHRW (or precisely, their variants) to avoid 
backtracking to the previously visited node up to the extent possible in order to reduce the temporal 
correlation over random consecutive samples, while maintaining the same stationary distribution so 
that the aforementioned mathematical framework for the unbiased estimation remains intact. Thus 
motivated, for the rest of this paper, we investigate how to achieve this at almost no additional 
cost, and rigorously prove that our proposed sampling methods give smaller (no worse) asymptotic 
variance than the SRW (with re-weighting) and MHRW-based ones, respectively. 
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4 Avoid Backtracking To Previously Visited Node 



In this section, we propose two random walk-based graph sampling methods - (i) non- backtracking 
random walk with re- weighting and (ii) MH algorithm with delayed acceptance, each of 
which theoretically guarantees unbiased graph sampling with smaller asymptotic variance than the 
SRW-rw and the (original) MH algorithm, respectively. In particular, our proposed sampling meth- 
ods require almost no additional cost, or more precisely, just remembering where the underlying 
random walk came from, when compared to the conventional methods. The reasoning behind the 
improvement of asymptotic variance is to modify each of SRW and MHRW, when making a transi- 
tion from the current node to one of its neighbors, to reduce bias toward the previous state (one of 
the neighbors of the current node), while maintaining the same stationary distribution. Note that 
such directional bias breaks the time-reversibility of the SRW and MHRW. Thus, a common build- 
ing block for our proposed sampling methods will be, for a given reversible Markov chain with its 
stationary distribution 7r, to construct a non-reversible Markov chain preserving the same n while 
avoiding (to the extent possible) transitions that backtrack to the state from which the chain just 
came. Our challenge here is to construct such a non-reversible chain with only one-state memory 
and theoretical guarantee for higher efficiency (smaller asymptotic variance). In what follows, we 
first explain a basic setup for this transformation and several relevant issues, and then present the 
details of our proposed methods. 

4.1 Prom Reversible To Non-reversible Chains 

Consider a generic random walk on Q, or a finite, irreducible, time-reversible Markov chain {Xt € 
J\f,t = 0,1,...}, with its transition matrix P = {P(i, j)}ijeN an d stationary distribution 7r = 
[ir(i),i £ A/"]. Our goal here is to construct its related new random walk or a finite, irreducible, 
non-reversible Markov chain with the same 7r which avoids backtracking to the previously visited 
node, which in turn produces a smaller asymptotic variance than the original reversible chain. An 
important requirement is that this transformation should be done at no additional cost and in a 
distributed manner. It is worth noting that there have been other works [9], [10] showing that certain 
non-reversible Markov chains or lifted Markov chains mix substantially faster than their related 
reversible chains. While this concept has been also applied to design a fast and efficient average 
consensus algorithm [19[ [22] , it is still unknown how to construct such a non-reversible chain or 
lifted Markov chain in a fully distributed or decentralized fashion, not to mention how to do so for 
any arbitrarily given target stationary distribution n. 

A general recipe for constructing a non-reversible Markov chain in an augmented state 
space: Let X[ G M, t = Q, 1, 2. . ., be the location of a new random walk at time t. At the current 
node X' t , the next node X' t+1 is decided based upon not only the current node X[ but also the 
previous node X[_ 1 so as to avoid backtracking. Due to the dependency (memory) to the previous 
node, {Xt} t > itself cannot be a Markov chain on the state space J\f, regardless of the choice of 
transition matrix. This walk, however, can still be made Markovian on an augmented state space 
instead, defined by 

M = {(i,j) : i,j eAfs.t. P(i,j) > 0} C AfxAf (10) 
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with < oo, and Z[ = (X^_ 1 ,X^) G Q for t > 1. For notational simplicity, let denote state 
GO. Note that ey 7^ ejj. It is also possible that en £ ^ for some i. A similar interpretation 
of a weighted random walk (or a reversible Markov chain) on the augmented state space can be 
also found in [2 t Ch.3], although its purpose is not for the construction of a related non-reversible 
chain. 

Let P' = {P'(eij,eik)} eij ,e lk <=n be the transition matrix of an irreducible Markov chain \Z[ £ 
f2, t = 1, 2, . . .} on the state space f2. Here, by definition, P'(eij, eik) = for all j 7^ I. If the unique 
stationary distribution tt' = [it' (e^) , £ fi] of the chain {Zj.} is given by 

7r'(ey) = Tr(i)P(i,j), £ O, (11) 

implying that it 1 {eij)=ir l [eu) from the reversibility of the original chain {X}, then the probability 
of the new random walk {Xj-} being at node j in the steady-state is the same as for all j (the 
stationary distribution of the original reversible chain {X^}). To see this, note that 

E 7r '( e 'i) = E "WP&j) = *U)> ^ G m> ( 12 ) 

where the first equality follows from P(u, v)=0, V(u, v) £" 0. In particular, for any original function 
of interest / : M—>M., choose another function g : $7— >M. such that g{eij) = f(j), and observe 

iMs) = E g{eij)A^) = E E fUMi)P(iJ) = E fUXJ) = M/)- 

Then, the SLLN in Theorem [T] gives 

^E^) = ^E/W ~ ► Hvfo)=M/) a.s., (13) 

i.e., ^2 l s= i g(Z' s )/t is a valid unbiased estimator for E.„.(/). We thus define, for any given function 
f :M^R, 

AU/) = 7E/M ( 14 ) 

1 3 = 1 

to be clearly distinguished from fat(f) in CO) defined based on the original chain {X}, while /&*(/) 
and fit(f) are both unbiased estimators for ~E„(f). In addition, the CLT in Theorem [2] implies 

t 



Vt- 



- t Y,9{Z' s )-^(g) 



Vt-m)-^(f)] =^ N(0,/(/)), (15) 



where cr' 2 (f) denotes the asymptotic variance of At(/) ( an d also of ^* =1 #(^s)A)- Throughout the 
paper, we use the prime symbol (') for any notation related to a newly defined process (e.g., {X|}) 
to differentiate it from its counterpart defined on the original process (e.g., {Xt}). 

While there are infinitely many different transition matrices P' leading to the unbiased esti- 
mator fl' t {f) for E 7r (/), our primary goal is, at (almost) no additional cost and in a distributed 
manner, to find a transition matrix P' that also guarantees smaller asymptotic variance. Under a 
rather restricted setting, R. Neal gave a partial answer to this in [27] saying that less backtrack- 
ing (rendering the resulting Markov chain {Z' t } non-reversible) can result in a smaller asymptotic 
variance. We restate his finding below. 
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Theorem 3 \27[ Theorem 2] Suppose that {X{\ is an irreducible, reversible Markov chain on the 
state space M with transition matrix P = {P(i,j)} and stationary distribution ir. Construct a 
Markov chain {Zj.} on the state space with transition matrix P' = {P'(eij, eik)} in which the 
transition probabilities P'(eij,eik) satisfy the following two conditions: for all eij,Cji,ejk,ekj £ ^ 
with i 7^ k, 

P(j, i)P'(e ij} e jk ) = P(j, k)P'(e kj , e*), (16) 
P'ieij^j^^P^k). (17) 

Then, the Markov chain {Z^} is irreducible and non-reversible with a unique stationary distribution 
ir' in which ir'{eij) = Tr{i)P{i,j), eij G f2. Also, for any function f, the asymptotic variance of 
fj! t {f) is no greater than that of ftt(f), i-&-, °~' 2 {f) < °" 2 (/)- ^ 

Remark 2 The condition in i!6\) ensures that the resulting transition matrix P' is stationary with 
respect to n' in ill]) and, in turn, leads to the unbiased estimator (i' t (f) for E 7r (/). Together with 



this condition, the condition in (H\) - less backtracking to the previously visited node, brings out 



the improvement of asymptotic variance. 

Theorem is quite versatile and provides a guideline on how to choose the transition matrix P' 
of a Markov chain {Z^} leading to smaller asymptotic variance, and thus will play an essential role 
in developing our graph sampling methods and subsequent analysis. Despite this large degree of 
freedom, it is still uncertain how to choose such a transition matrix P' at no additional cost. While 
R. Neal suggested a procedure to find P', it generally poses significant cost, especially for improving 
the MH algorithm, as admitted in [27]. (See pp. 9-10 therein.) Recall that the MH algorithm 
(Algorithm [1]) for MHRW only needs the degree information of a randomly chosen neighbor j of 
the current node Xt to decide whether or not to move j, as mentioned in Remark [T^l However, 
the procedure by R. Neal necessitates the explicit knowledge of all P(i,i) 7 s [27]. That is, the 
corresponding modified MHRW would require all the neighbors' degree information of the current 
node X[ at each time t in order to choose the next node X' t+l . Imagine such modified MHRW 
crawling over an OSN (say, Facebook) and located at a certain user's page. To simply decide 
where to go, the walk would have to visit all his/her friends' pages first and collect all their degree 
information (i.e., the number of friends) before making decision to move. This is clearly impractical 
for our graph sampling purpose. Therefore, in this paper, we set out to develop our own graph 
samplers with higher efficiency without any such overhead, by leveraging Theorem [3] as a building 
block. 



4.2 Non-backtracking Random Walk with Re-weighting 

We first introduce non-backtracking random walk with re-weighting (NBRW-rw) that ensures un- 
biased graph sampling, and then prove that NBRW-rw guarantees a smaller asymptotic variance 

*More generally, in the MH algorithm with any proposal matrix Q, it is often unnecessary to know self-transition 
probabilities P(i, i) explicitly, or does not require summing the probabilities of rejection for all possible proposals 
just to compute P(i, i) = Q{i, Q(h j))- 
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Figure 1: Illustrating the transitions of an NBRW over the nodes of Q. The walker is currently 
located at node j (with d(j) = 4) and just came from node i. From j, it will move to one of its 
neighbors except node i with equal probability. 



than SRW-rw. The non-backtracking random walk (NBRW) is a discrete-time random walk which 
'never' backtracks (thus named non-backtracking) to the previous node (whenever possible) while 
preserving the same stationary distribution as that of a SRW. Thus, the proposed sampling method 
is to use an NBRW, instead of a SRW, to collect a sequence of samples by crawling over a target 
Q, and at the same time, to employ the same re-weighting process as is done for SRW in order to 
eliminate sampling bias induced from its non-uniform stationary distribution. 

Consider an irreducible, reversible Markov chain {X t } t >o (a sequence of visited nodes) by the 
SRW with its transition matrix P = {P(i, given by ©, and stationary distribution ir(i) = 

d{i)/(2\£\), i£j\f. Then, the NBRW is defined as follows. A (discrete-time) random walk at the 
current node j with d(j) > 2 moves to the next node k, chosen u.a.r. from the neighbors of node j 
except the previous node i. If the current node j has only one neighbor (d(j) = 1), the walk always 
returns to the previous node i. Figure Q] depicts this non-backtracking nature of the NBRW in its 
transitions over the nodes of Q. Here, an initial position of the NBRW can be arbitrarily chosen. 
The NBRW initially moves from the initial position to one of its neighbors with equal probability 
due to the absence of its 'previous node', and then proceeds as defined above thereafter. 

Let X[ £ J\f, t = 0, 1, 2, . . ., be the location of an NBRW. As before, we construct a Markov 
chain {Z' t = (X' t _ x , X[)} t >i with its transition matrix P' = {P' (eij , eik)} eij ,e ik &n given by, for all 
, ejk £ ^ with i ^ k (d(j) > 2), 

implying that P'(eij,eji) = 0. Also, P'(eij,eji) = 1 for any j with d(j) = 1. All other elements of 
P' are zero. Clearly, P' satisfies the conditions in (I16p - (ll7p . From Theorem El the Markov chain 
{Zj-} is irreducible and non-reversible with a unique stationary distribution 

7r'(e ) = n(i)P(i,j) = — - , eij G n. (19) 

That is, the probability of the NBRW being at node j in the steady-state is the same as vr(j'). 
See (|12p . From (|13p -(|14 p and Theorem [3j we also know that for any given function / of interest, 
fi' t {f) and fit(f) are both unbiased estimators for E,r(/), and the asymptotic variance of fi' t {f) 
(based on the random samples by the NBRW) is no larger than that of fit{f) (by the SRW), i.e., 
e' 2 (f)<° 2 U). 
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However, both unbiased estimators fi' t (f) and jlt{f) are for E 7r (/), not K u (f). It is unclear 
whether such improvement for the asymptotic variance remains true even after a proper re- weighting 
to obtain unbiased samples. As explained in Section 13. 1} the SRW-rw is to use the estimator 
p>t(wf)/ fit(w) with w(i) = l/d(i) (i£j\f) in order to consistently estimate E u (/). Since the station- 
ary distribution of the NBRW remains the same as that of the SRW, we can also use the estimator 
fi't(w f) I ' fi't{w) with the same weight function w, as a valid approximation of E u (/). Let cr^j(f) 
and o"'\/v(/) denote the asymptotic variances of the estimators jlt{wf) / fit{w) and £i' t (wf)/fi' t (w), 
respectively. To proceed, we need the following. 



Theorem 4 (Slutsky's theorem) JJ, pp.332] 

Let {At} and {B{\ be the sequences of random variables. If At 

probability to a non-zero constant b, then At/Bt ==^ A/b. 

Now we state our main result. 



A, and Bt converges in 



□ 



Theorem 5 For any function f : N — > R, the asymptotic variance of fi' t {w f) / jl' t {w) is no larger 
than that of £tt(wf)/ fit{w), i.e., < o~wif), where the weight function w is given by w{i) = 

l/d(i), i G M. □ 



Proof: Since the estimator £it(wf)/ fit(w) remains invariant up to a constant multiple of w, 
without loss of generality, we can set w(i) = u{i)/n{i) = 2\£\/(nd(i)). For any given /, observe 
that 



Vt 



fkjwf) 

At(«0 



Vt 



E s =iMx.s)f(x s 



E„(/) 



-Vt 



El=Mx s )(f(x s )-E u (f)) 



(20) 



Define another function h : M — > K such that 

h(i)±w(i)(f(i)-E u (f)), i G , 
implying M„(h) = YlieAf = 0. Then, from Theorems [T] and [H we have, as t — > oo, 

1 * [i ' 

- w(X s ) — > 1 a.s., and V - h(X s 



s=l 



N(0,a^)). 



Since almost sure convergence implies convergence in probability [4J, by Slutsky's theorem, from 
(pOD . we have 



V 



k{wf) 
h{w) 



E u (/) 



N(0,cr 2 (/i)), as t^oo. 



Together with (|13p and (|15p . following the same lines above, we similarly have 



V 



Eu(/) 



=4 N(0,(j ,2 (/i)), asi^oo. 
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Hence, for a given /, the asymptotic variance of the estimator £tt(w f) / fct(w) is nothing but cr^(/) = 
a 2 (h). Similarly, o~ ,2 y{f) = a' 2 (h). Therefore, since Theorem [3] says that for any function /, 
o~' 2 (f) < °~ 2 {f)-, we a ls° have cr'w(f) — °w(/)- That is, the asymptotic variance of fi' t (w f) / (i' t {w) 
is no larger than that of flt(wf)/fit(w). □ 

Remark 3 7n J3y, t/ie NBRW was originally considered for regular graphs with d(i) = d > 3, Vi G 
A/", and shown to lead to faster mixing rate (i.e., faster rate of convergence to its stationary distri- 
bution) than that of the SRW. In contrast, for any general (connected, undirected, not necessarily 
regular) graph Q , we show that the NBRW-rw ensures not only the unbiased graph sampling but 
also smaller asymptotic variance than the SRW-rw. 

4.3 Metropolis-Hastings Algorithm with Delayed Acceptance 

We turn our attention to improving the MH algorithm. For any given, desired stationary dis- 
tribution 7r, we propose Metropolis-Hastings algorithm with delayed acceptance (MHDA), which 
theoretically guarantees smaller asymptotic variance than the (generic) MH algorithm with pro- 
posal matrix Q that constructs a reversible Markov chain with arbitrary tt. In particular, we 
demonstrate that MHDA can be applied, as a special case, to construct a (non-Markovian) random 
walk on a graph Q which not only achieves a uniform stationary distribution tt = u for unbiased 
graph sampling, but leads to higher efficiency than MHRW by the MH algorithm (Algorithm [1]) . 
We emphasize that the only additional overhead here is remembering the previously visited node 
(one of the neighbors of the current node) from which the random walk came. 

Interpreting a reversible MH Markov chain as a semi-Markov chain: Consider an irre- 
ducible, reversible Markov chain {Xt G M}t>o by the MH algorithm with its transition matrix 
ijeAT given by ([5]), and any arbitrarily given target stationary distribution tt. Recall 
that the MH algorithm is nothing but a repetition of proposing a state transition with proposal 
probability Q(i,j) that is then accepted with an acceptance probability A(i,j) in ([7]) or rejected 
otherwise. Observe that the process {Af}, after entering into state (node) i, stays at state i 
for a geometrically distributed time duration with mean 1/(1 — P(i,i)), and then moves to an- 
other state j G N(i). Formally, define a Markov chain {X m G A/"} m >o with its transition matrix 
P-{P{hj)}i,jeAf given by, for j ^ i, 

Pc ■) - P&j) _ QihfiMhj) _ min {Q(^j) 1 Q(j^) 7r (i)/ 7r (^)} / 21 n 
{hJ) i-p(m) Ej^Q(i,j)A(hj) Zfrin&m,3),QV,i)*V)MQV [ ' 

with P(i,i) = 0. It is not difficult to see that the chain {X m } is irreducible, and reversible with 
respect to a unique stationary distribution tv = [tt (i), i £J\T\, given by 

7f(i) oc 7r(i)(l - P(i,i)), i G N. 

Also, we define a function 7 : M — > E such that, for i G Af, 

4 l-P(M) = ^min{Q(i,j),Q(j,i)7T(j)/7T(i)}, (22) 
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Figure 2: An example graph Q 



and define a sequence {£m}m>o for which £ m depends solely on {X m } m >o and is geometrically 
distributed with parameter 7(X m ). It thus follows that E{£ m |X m = i} = 1/7(1), i £ N. The 
process {Xt} can now be interpreted as a semi-Markov chain with embedded Markov chain {X m } 
and respective sojourn times {£ m }. Suppose that the random walk by the MH algorithm (or the 
process {Xt}) enters node j of a graph Q, depicted in Figure at time t = 1 (Xi = j). If we 
consider a sample path (Xi, X2, ■ ■ ■ , X7) = i, k), then we have corresponding sequences 

(Xi, X2, X3, X4) = j, k) and (£1, £2, £3) = (3, 2, 1). Note that the standard definition of a semi- 
Markov process allows the sojourn time £ m to depend on both X m and X m+ i (and so we are dealing 
with a special case). From the theory of semi-Markov processes (e.g., [32]), one can easily recover 
the stationary distribution 7r as 

7r(i) oc #(i)/7(i), * G AT. (23) 

The above interpretation has been similarly given in the MCMC literature [231 HI] - In particular, 
it is known that, for any given function / : J\f— >M, 

( t \ A Sj=l ^lf(Xl) , .s 
Mm,MHUj - ^ m c (24J 

converges almost surely to E^/), as m — > 00, and thus /t m ,MH(/) is also an unbiased estimator 
for E w (/) |23j . This definition of fi m ,MH(f) enables more tractable analysis on its asymptotic 
variance, denoted as <r^ H (/), by connecting it to its counterpart in the importance sampling for 
Markov chains |23^ [TT] . Note that for sufficiently large t (also m) , the (original) unbiased estimator 
At(/) = X^s=i f(X s )/t can be written as Am,MH(/) plus some negligible term (after setting the same 
initial point Xi = Xi), because it is always possible to find m such that < ^ < Ya=i Cl- 

Also, in the limit t,m— >oo, fit(f) and Am,MH(/) are the same. We thus focus on estimators in the 
form of Am,MH(/) m our subsequent analysis. 

Consider a sequence of pairs (X m , £ m ). From the success in the NBRW-rw, one may ask what if 
the reversible embedded Markov chain {X m } m > is replaced by a related stochastic process {X' m £ 
Af}m>o, or more precisely, a non-reversible Markov chain {{X' m _ x , X' m )} m >i on the augmented 
state space Q, which avoids backtracking transitions to the extent possible, while preserving the 
same stationary distribution ir. Another question can be whether this transformation guarantees 
that the estimator in tflty based on (X' m ,^ m ) remains unbiased for E 7r (/) and also have higher 
efficiency than the original one. Our answer is in the affirmative, and this is the reasoning behind 
the improvement of our proposed MHDA over the standard MH algorithm. We stress here that, in 
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contrast to the NBRW, backtracking transitions in the process {X^} should be avoided only up to 
the extent possible^ so as to maintain the arbitrarily given original stationary distribution 7r. Thus, 
the extension from the MH algorithm to our proposed MHDA becomes necessarily more involved 
than the case of NBRW. 

Description of MHDA: Let X[ G J\f, t = 0, 1, 2, . . ., be the position of a random walk (or the 
state of a stochastic process). We also define the augmented state space Q in (fTU|) based on the 
reversible embedded Markov chain {X m }, where P(i,i) = and so en for all i. 

MHDA is described as follows. Suppose that node i is the previous node from which the walk 
came. MHDA first operates just like the MH algorithm. At the current node (state) X[ = j ^ i, 
the next node X' t+1 = k G N(j) is proposed with probability Q(j, k) (j ^ k). Then, the proposed 
transition to k is accepted with probability A(j,k) in ([7]), and rejected with probability 1 — A(j,k) 
in which case X' t+l = j. Here, in contrast to the MH algorithm, MHDA renders the accepted 
transition to X' t < 1 = k temporarily pending, and applies another procedure to proceed with the 
actual transition to A;@ 

Specifically, for the accepted transition to k, if k ^ i, then the 'actual' transition takes place, 
i.e., X' t+l = k, which happens with probability P(j,k) = Q(j,k)A(j,k) as in the MH algorithm. On 
the other hand, if k = i, then the transition to node i is delayed (thus named 'delayed acceptance'). 
The next node X' t+1 is again proposed with another proposal probability Q'(eij,ejk), which is a 
transition probability of an arbitrary Markov chain on the state space fi, where Q'(eij, eik) = for 
all j ^ I, and Q'{eij,ejk) > if and only if Q'{ekj,eji) > 0. The (second) proposed transition to 
X' t+1 =k is accepted with another acceptance probability A'(eij, ejf.), and rejected with probability 
1— A'(eij, ejk) in which case X' t+1 =i (backtracking occurs). That is, transition probability P(j, i) = 
Q(j, i)A(j, i) in the MH algorithm is leveraged to create another transition opportunity from j to 
k 7^ i in the MHDA. So, the transition from j to k ^ i occurs with larger probability P(j, k) + 
P(j,i)Q'{eij,ejk)A'(eij,ejk) than the MH algorithm (w.p. P(j,k)). This is also illustrated in 
Figure [3] where the thickness of arrows represents the corresponding transition probabilities from 
node j to other node (including self-transition). The new acceptance probability A'(eij,ejk) will 
be specified shortly. 

In summary, under MHDA, the walker stays at each node for the same random amount of time 
as it would be under the MH algorithm, while reducing the bias toward the previous node when 
making transitions to one of its neighbors. 

Analysis of MHDA: Let X' m , m>0, be the sequence of nodes visited by the walk, which moves 
over Q according to the MHDA. The process {X' m } is clearly different from the reversible, embedded 
Markov chain {X m } for the MH algorithm. Also, let £' m , m>0, be the respective sojourn time at 
node X' m . Note that the MHDA behaves differently from the MH algorithm (performs the additional 
procedure) only when a proposed transition from node j to node k ^ j (occurring with probability 
Q(j,k)) is accepted with probability A(j,k) in the MH algorithm. Thus, £f m is also geometrically 

^ If {X^} is made purely non-backtracking just like we did for NBRW, then we lose unbiasedness for the resulting 
MH-based estimator in general. 

'if the transition to k=j was accepted after a proposal with Q(j,j) >0, then the MHDA accepts the transition 
as in the MH algorithm without any further action. 
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(a) MH algorithm (b) MHDA 



Figure 3: Illustrating a difference between MH algorithm and MHDA: a walker moves from node 
j to node k with probability P(j, k) in the MH algorithm, but with larger probability P(j, k) + 
P{j,i)Q'(e ij ,e jk )A'{e ij ,e jk ) in the MHDA. 

distributed with parameter ^{X' m ). See ([22]) for That is, given that X' mi = X m2 = i, the 

sojourn times £' mi and ^ m2 have identical distributions. Therefore, the MHDA, similar to the MH 
algorithm, can be also characterized by a sequence of the pairs (X' m ,^' m ). As an example, if the 
random walk by the MHDA (or the process {X^.}) enters node i in Figure [2] at time t=l (X[ = i) 
and (X[ , X 2 , . . . , X' 7 ) = (i,i,l, j, j, j, k) , then we consequently have (X[, X 2 , X%, X'^) = (i, /, j, k) and 
(£UUs) = (2, M)- We define > fOT an y g iven / : 

y / f] A Ylk=i£lf( X l) (l? z\ 

Mm,MHDAW / — h ■ \ zc >) 

We prove below that AmMHDA(Z) converges almost surely to E 7r (/), implying that Am mhda(/) ^ s 
an unbiased estimator for E 7r (/). We also prove that, after showing the CLT holds for jx' m mhda(/)> 
the asymptotic variance of fa' m (f), denoted as o"'mhda(/)) i s smaller than its counterpart <7^ H (/) 
for the MH algorithm. 

To this end, we first explain how to properly choose the new acceptance A' '(eij,ejk) so that 
the process {AT m } has the same stationary distribution as that of the reversible embedded chain 
{X m }, while, at the same time, the process {X' m } reduces backtracking transitions. Instead of 
the process {X' m }, we deal with its related non-reversible Markov chain defined on the augmented 
state space n by consulting the general recipe for this purpose in Section 14.11 Recall the state 
space n in (jlOp obtained from the transition matrix P = {P(i, j)} of the chain {X m }. We define 
Z' m = (A^ m _ 1 ,A^ m ) e for m > 1, and P' = {P'(eij,eik)} e%J ,e lk en to be the transition matrix of 
a Markov chain {Z' m } m >i. For instance, consider a sample path (X[, X 2 , X' 3 , X' 4 ) = (i,l,j,k) in 
the above example. We have (Z 2 , Z^, Z' 4 ) = ((i,l),(J,j),(j,k)). If the chain {Z' m } has a unique 
stationary distribution tv' = [ir'(eij), eij given by 

7f / (e ii ) = 7r(i)-P(i,i), etj €Sl, (26) 

implying that Tr'{eij)=%'{eji) from the reversibility of the embedded chain {X m }, then the steady- 
state probability of the process {AT m } being at node j is the same as Tv(j) for all j. From the 
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description of MHDA, observe that, for all , ejfc £ f2 with i ^ k (d(j) > 2), 

P'{eij,e jk ) = P(j,k) + P(j,i)Q'(e ij ,e jk )A'(e ij ,e jk ), (27) 

while P'(eij,eji) = 1 — Ylk^i P'( e ij> e jfc)' as * s a ^ so shown in Figure [3jb). Note that P'(eij,ej k ) 
specifies the next node of the random walk by MHDA, given that the walk has to move from the 
current node to one of its neighbors (its sojourn time is over). Thus, P(j,k) and P(j,i) are used 
here instead of P(j,k) and P(j,i), respectively. In addition, for any j with d{j) = 1, we have 
P'(eij, eji) = P(j, i) = 1, G £, since Q'(eij, eji) = 1 (due to the stochastic matrix {Q'(eij, ei k }) 
and A'(eij,eji) = l which is shown below. 

Among many possible choices for the acceptance probability A'(eij,ej k ) in the MHDA, we have 
the following. 

Proposition 1 For any given {Q' , ei k )} , suppose that the acceptance probability A'ie-ij, e,-fc) is 
given by 

Then, the resulting transition matrix Y*' , andP satisfy conditions &16\) -(17). □ 



Proof: See Appendix A. □ 

From Theorem [3] and Proposition [H the Markov chain {Z' m } with its transition matrix P' as 
in (I27p and (|28l) . is irreducible and non-reversible with a unique stationary distribution ~k' in (I26p . 
This also implies that the process {X^} has the same stationary distribution 7r, as explained before. 
We now present our main result. 

Theorem 6 Consider a given, desired stationary distribution ir = [ir(i), i £ J\f] . Under the MHDA 
with any given {Q' '(e^ , ei k )} and its corresponding A'(eij,ejk) in $28\). for any given function f : 
N — t- 1R, asm —7- oo, fi' m M^DA^f) converges almost surely to E 7r (/), and also the asymptotic variance 
°i ' K n ,MHDAU) is n ° lar 9er than that of p, m ,MH(f), i-e., o- /2 MHDA (f)<a 2 MH (f). □ 



Proof: See Appendix B. □ 

An application of MHDA for unbiased graph sampling: We explain how MHDA can be 
applied for unbiased graph sampling applications. In particular, we present how to construct a 
(discrete-time) random walk by MHDRA, named Metropolis-Hastings Random walk with Delayed 
Acceptance (MHRW-DA), on Q that achieves the uniform stationary distribution, i.e., iv = u. The 
MHRW-DA here operates as an extension of Algorithm Q] with the following choice of {Q'(eij, e^)}: 
for all eij, ejk G f2 with i / k (d(j) > 2), 

Q'(e ij ,e jk ) = l/(d(j)-l), (29) 

implying that Q'(eij,eji) = 0. Also, Q'(eij, eji) = 1 for any j with d(J) = 1. All other elements 
are zero. While {Q'(eij, en-)} is the same as the transition matrix of NBRW, a 'Metropolizing' 
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step, which is done with A'(eij,ejk) in (|28p . must follow in order to ensure that the stationary 
distribution is uniform and the resulting estimator is unbiased. In other words, A'(eij,ejk) in (|28j> 
becomes 

A'( eij , e jk ) = min{ 1, min { ^} / min { ^ , ^ } } . 

This version of the MHDA is summarized in Algorithm [21 where X[ £ M is the location of MHRW- 
DA at time t and Yt € M indicates the previous node from which the MHRW-DA came (Yj 7^ A^). 
Here, X' Q can be chosen arbitrarily. Since there is no notion of 'previous node' Yq a t time t = 0, 
MHRW-DA initially behaves the same as MHRW until it moves from the initial position to one of 
its neighbors, and then proceeds as described in Algorithm [2] thereafter. 

Algorithm 2 MHDA for MHRW-DA (at time t) 

1: Choose node i u.a.r. from neighbors of X' t , i.e., N(X' t ) 

2: Generate p ~ U(0, 1) 

3: if p < min jl, j then 

4: if Y t = i and d(X' t ) > 1 then 

5: Choose node k u.a.r. from N{X[) \ {i} 

6: Generate q ~ U(0, 1) 

7: if g<min|l,rnin|l, (^) jmaxjl, (jggy) jjthen 

8: A 4 ' +1 <- k and Y" t+ i X t ' 

9: else 

10: X' t+1 i and y t+ i ^- A t ' 

11: end if 
12: else 

13: X' t+l <r- i and Y t+1 ^- A t ' 
14: end if 
15: else 

16: X' t+l <- A t ' and F t 
17: end if 



Theorem [6] states that the MHDA works for any given stationary distribution 7r, while allowing 
us to freely choose the new proposal probabilities {Q'faj, e/fc)} as desired. Thus, Algorithm [2] 
for MHRW-DA is nothing but a 'special case' of the MHDA. Theorem M asserts that MHRW-DA 
produces unbiased samples with higher efficiency than the corresponding MHRW (Algorithm [T]) . 
Again, we emphasize that the only additional overhead for MHRW-DA, compared to the MHRW, is 
remembering where it came from, Yt. Note that the degree of the previous node Yt is already known 
and can easily be retrieved, while the degree information of another randomly chosen neighbor is 
also necessary anyway even in the MH algorithm (to decide whether or not to move there). 
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5 Simulation Results 



In this section, we present simulation results to support our theoretical findings. To this end, we 
use the following real- world network datasets pQ: 

• AS- 733 - an undirected graph of autonomous systems (ASs) composed of 6474 nodes and 13233 
edges, where nodes represent ASs and edges exist according to AS-AS peering relationships. 

• HEP-TH - a collaboration network among authors who submit papers to High Energy Physics- 
Theory category in the e-print arXiv, forming an undirected graph with 9877 nodes and 51971 
edges, where nodes represent authors and edges exist between authors if coauthoring a paper. 

• Road- PA - a road network of Pennsylvania, forming an undirected graph with 1088092 nodes 
and 3083796 edges, where nodes represent intersections and endpoints and edges represent the 
roads connecting them. 

• Web-Google - a directed web graph with 875713 nodes and 5105039 edges, where nodes rep- 
resent web pages and directed edges represent hyperlinks between them. For our simulation, we 
use an undirected version of this web graph. 

To ensure graph connectivity, we also use the largest connected component (LCC) of each graph, 
where the LCC sizes of the AS-733, HEP-TH, Road-PA, and Web-Google graphs are 6474, 8638, 
1087562, and 855802, respectively. Here, the average degrees of AS-733, HEP-TH, Road-PA, and 
Web-Google graphs are 4.09, 5.75, 2.83, and 10.03, while their maximum degrees are 1459, 65, 9, 
and 6332, respectively. 

As a test case, we consider the estimation of the degree distribution of each graph - ¥{Dg = d} 
(pdf) and ¥{Dg > d} (ccdf), to evaluate and compare our proposed NBRW-rw and MHRW-DA 
(MHDA in Algorithm [2]) against SRW-rw and MHRW (MH algorithm in Algorithm [T]), respectively. 
As mentioned before, to estimate F{Dg = d}, we just need to choose a function f(i) = l{d(i)=d}, i £ N, 
for the corresponding estimators. Similarly, we choose f(i) = l{d(i)>d} f° r ^{Dg >d}. To measure 
the estimation accuracy, we use the following normalized root mean square error (NRMSE) [5j [30} 
[20] . y / E{(x(t) — x) 2 }/x, where x(t) is the estimated value out of t samples and x is the (ground- 
truth) real value, (x = lim^oo x(t) from unbiasedness.) In all simulations, an initial position of 
each random walk is drawn from its stationary distribution as similarly used in [5] , unless otherwise 
specified. In practical implementations, one can employ a 'burn-in' period to drive the random walk 
close to its steady-state [T2]. Each data point reported here for AS-733 and HEP-TH graphs is 
obtained from 10 independent simulations, while, for Road-PA and Web-Google graphs, the data 
points are based on 10 5 and 5 • 10 5 simulations, respectively. 

We first present the simulation results for AS-733 graph whose 'actual' degree distribution 
is almost a 'power-law' as depicted in Figure [5] (insets). Figure 2] shows that NBRW-rw (resp. 
MHRW-DA) outperforms SRW-rw (resp. MHRW) in terms of the required number of samples 
(cost) to achieve the same level of estimation error when estimating ¥{Dg = d}, as expected from 
our theoretical results. Here, the NBRW-rw (resp. MHRW-DA) brings out about 35% (resp. 14%) 
cost saving on average, when compared to the SRW-rw (resp. MHRW). In addition, we plot, in 
Figure^ the NRMSE ratio of SRW-rw (resp. MHRW) to the case of NBRW-rw (resp. MHRW-DA) 



21 



AS-733 



AS-733 




3000 3500 4000 4500 5000 5500 6000 3000 3500 4000 4500 5000 5500 6000 

# of samples # of samples 



(a) SRW-rw vs. NBRW-rw (b) MHRW vs. MHRW-DA 

Figure 4: AS-733 graph. NRMSE (averaged over all possible degrees d) of the estimator ofP{Dg = 
d}, when we vary the number of samples; the insets are for smaller number of samples. 
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Figure 5: AS-733 graph. NRMSE ratio (per degree d) when estimating F{Dg = d} with 10 4 samples; 
the insets represent the 'actual' degree distribution (ccdf) in (a) log-log scale, (b) semi-log scale. 



for every degree d when estimating ¥{Dg = d} with 10 4 samples. It clearly shows the improvement 
of our proposed methods for each degree d (all data points are above one). We also provide the 
NRMSE curve (with its ratio), in Figure [H for the comparison between NBRW-rw (resp. MHRW- 
DA) and SRW-rw (resp. MHRW) when estimating F{Dg > d} with 10 4 samples, which is again 
clearly consistent with our theoretical findings. In addition, we conduct another simulation to see 
the impact of non-stationary start for each random walk on the sampling accuracy, for which an 
initial position of each SRW and NBRW is drawn from a uniform distribution, while the initial 
position for MHRW and MHRW-DA is chosen with a probability proportional to node degree. 
Under this setting, we measure NRMSE of the estimator of P{Dg = d}, and observe that NBRW- 
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Figure 6: AS-733 graph. NRMSE (per degree d) when estimating ^?{Dg > d} with 10 4 samples; 
the insets show NRMSE ratio. 
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Figure 7: AS-733 graph. NRMSE (averaged over all possible d) of the estimator of ^{Dg 
when each random walk does not start in the stationary regime. 



d}, 



rw and MHRW-DA still outperform SRW-rw and MHRW, respectively, as shown in Figure [7l Note 
that there is not much difference between the stationary start and non-stationary start cases. (See 
Figures Hand [3 ) 

We next provide the simulation results for HEP-TH graph whose actual degree distribution is 
close to exponential as depicted in Figure [9] (insets). As before, Figure [8] demonstrates that NBRW- 
rw and MHRW-DA surpass SRW-rw and MHRW for the estimation of ^{Dg = d}, respectively. 
Specifically, the NBRW-rw (resp. MHRW-DA) saves, on average, about 22% (resp. 12%) of the 
required number of samples to attain the same level of estimation accuracy, which compared to the 
SRW-rw (resp. MHRW). Also, Figure [9] shows the NRMSE ratio of SRW-rw (resp. MHRW) to the 
case of NBRW-rw (resp. MHRW-DA) for every degree d for the estimation of ¥{Dg = d} with 10 
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Figure 8: HEP-TH graph. NRMSE (averaged over all possible d) of the estimator of ¥{Dg = d} 
with different number of samples; the insets are for smaller number of samples. 
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Figure 9: HEP-TH graph. NRMSE ratio (per degree d) for the estimation of ¥{D g = d} with 10 4 
samples; the insets represent the 'actual' degree distribution (ccdf) in (a) log-log scale, (b) semi-log 
scale. 



samples, while Figure [10] depicts the NRMSE curve (with its ratio) when estimating P{-Dg > d} 
with 10 4 samples. Both results are again in good agreement with our theoretical results. Moreover, 
after repeating the same experiment for the non-stationary start as above, we observe that the 
improvement from NBRW-rw and MHRW-DA remains preserved, as shown in Figure [TTJ 

We also present the simulation results for Road-PA graph in which every node has small degree, 
ranging from 1 to 9, and the actual degree distribution (pdf) is given in Figure [131 (inset). As seen 
from Figure [T2"l SRW-rw (resp. MHRW) requires more than twice larger samples than the case of 
NBRW-rw (resp. MHRW-DA) to attain the same level of accuracy for the estimation of ¥{Dg = d}. 
Specifically, the NBRW-rw and MHRW-DA leads to about 60% and 54% cost saving on average. 



24 



HEP-TH 



HEP-TH 




(a) SRW-rw vs. NBRW-rw 



0.8 



UJ 

w 0.6 
DC 



0.4 
0.2 





20 40 60 



o MHRW 

+ MHRW-DA 



10 20 30 40 50 60 70 
Degree 

(b) MHRW vs. MHRW-DA 



Figure 10: HEP-TH graph. NRMSE (per degree d) when estimating F{Dg>d} with 10 4 samples; 
the insets show NRMSE ratio. 
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HEP-TH graph. NRMSE (averaged over all possible d) of the estimator of ^{Dg = d}, 
random walk does not start in the steady-state. 



Also, as before, Figure [13 shows the NRMSE ratio of SRW-rw (resp. MHRW) to the case of NBRW- 
rw (resp. MHRW-DA) for every degree d when estimating V{Dg = d} with 5 • 10 5 samples, and 
Figure [14] depicts the NRMSE curve (with its ratio) for the estimation of F{Dg > d} with 5 • 10 5 
samples, which are all in good agreement with our theoretical findings. In addition, Figure [15] 
demonstrates that such considerable performance improvement of NBRW-rw (resp. MHRW-DA) 
over SRW-rw (resp. MHRW) still prevails for the case of non-stationary start. We observe that 
the NBRW-rw and MHRW-DA are remarkably effective for Road-PA graph, as the graph structure 
with small node degrees makes the less-backtracking feature more favorable. 

We finally provide the simulation results for Web-Google graph whose actual degree distribution 
is more like a power-law as shown in Figure [T71 (insets). Figure [TBI demonstrates that NBRW-rw 
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Figure 12: Road-PA graph. NRMSE (averaged over all possible d) of the estimator of ¥{Dg = d}, 
while varying the number of samples; the insets are for smaller number of samples. 
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Figure 13: Road-PA graph. NRMSE ratio (per degree d) when estimating P{Dg = d} with 5 • 10 5 
samples; the inset represents the 'actual' degree distribution (pdf). 



(resp. MHRW-DA) surpasses SRW-rw (resp. MHRW) overall for the estimation of f{Dg = d}, 
although their improvements are not as large as before. Again, Figure [T71 shows the NRMSE ratio 
of SRW-rw (resp. MHRW) to the case of NBRW-rw (resp. MHRW-DA) for every degree d in 
estimating P{Dg = d} with 5 • 10 5 samples, and Figure [TH] depicts the NRMSE curve (with its ratio) 
for the estimation of P{Dg >d} with 5-10 5 samples. Clearly, NBRW-rw performs better than SRW- 
rw for each degree d (all data points are above one), as expected from our theoretical results. We 
also observe similar results when comparing MHRW-DA and MHRW. There is, however, just one 
data point below one (in the ratio) for the estimation of ¥{Dg = d} . We admit that such an 'outlier' 
may be possible, since our theoretical results hold in the asymptotic sense. Nonetheless, MHRW- 
DA leads to an overall performance improvement (over all possible d). In addition, we observe 
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Figure 14: Road-PA graph. NRMSE (per degree d) for the estimation of F{Dg > d} with 5 • 10 5 
samples; the insets show NRMSE ratio. 
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Figure 15: Road-PA graph. NRMSE (averaged over all possible d) of the estimator of ¥{Dg = d}, 
when an initial position of each random walk is not drawn from its stationary distribution. 



that NBRW-rw (resp. MHRW-DA) remains effective in achieving higher sampling accuracy than 
SRW-rw (resp. MHRW), even when each random walk does not start in the stationary regime, as 
shown in Figure [T9l 

It is also worth noting that a direct comparison between SRW-rw (or NBRW-rw) and MH 
algorithm (or MHDA) may not be appropriate. Recently, [12J numerically shows a counter-example 
that MHRW can be more efficient, although SRW-rw has been shown to be better than the MHRW 
over several numerical simulations |29|, I12j. In addition, [6] proved, through several examples, that 
there is no clear winner between the importance sampling for reversible Markov chains (whose 
special case is the SRW-rw) and the MH algorithm. The MH algorithm also is valuable because it 
can be used to construct a reversible chain with any given stationary distribution. We thus have 
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Figure 16: Web-Google graph. NRMSE (averaged over all possible d) of the estimator of P{Dg = d} 
with different number of samples; the insets are for smaller number of samples. 
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Figure 17: Web-Google graph. NRMSE ratio (per degree d) when estimating P{Dg = d} with 5- 10 5 
samples; the insets represent the 'actual' degree distribution (ccdf) in (a) log-log scale, (b) semi-log 
scale. 



focused on improving each of the SRW-rw and the MH algorithm separately. 



6 Concluding Remarks 

We demonstrated, in theory and simulation, that our proposed NBRW-rw and MHDA guarantee 
unbiased graph sampling, while also achieving higher sampling efficiency than SRW-rw and MH 
algorithm, respectively. While the focus of this paper was on the unbiased graph sampling, we 
cannot stress enough the versatile applicability of the MHDA for any non-uniform node sampling 
(e.g., intentionally creating a known bias toward preferable nodes), not to mention its improvement 
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Figure 18: Web-Google graph. NRMSE (per degree d) when estimating F{Dg > d} with 5 • 10 5 
samples; the insets show NRMSE ratio. 



0.55 



Web-Google 



Web-Google 





(a) SRW-rw vs. NBRW-rw 



(b) MHRW vs. MHRW-DA 



Figure 19: Web-Google graph. NRMSE (averaged over all possible d) of the estimator of ¥{Dg = d}, 
when each random walk does not start in the stationary regime. 

over the famous MH algorithm in sampling efficiency. We expect that the MHDA can be applied 
to many other problems beyond the unbiased graph sampling. 



References 

[1] Stanford Large Network Dataset Collection, http://snap.stanford.edu/data/. 

[2] D. Aldous and J. Fill. Reversible Markov Chains and Random Walks on Graphs, monograph 
in preparation. 



29 



[3] N. Alon, I. Benjamini, E. Lubetzky, and S. Sodin. Non-backtracking random walks mix faster. 
Communications in Contemporary Mathematics, 9(4):585-603, 2007. 

[4] R. B. Ash and C. A. Doleans-Dade. Probability and measure theory. Academic Press, second 
edition, 2000. 

[5] K. Avrachenkov, B. Ribeiro, and D. Towsley. Improving random walk estimation accuracy 
with uniform restarts. In WAW, 2010. 

[6] F. Bassetti and P. Diaconis. Examples comparing importance sampling and the Metropolis 
algorithm. Illinois Journal of Mathematics, 50(1):67-91, 2006. 

[7] P. Berenbrink, C. Cooper, T. R. R. Elsasser, and T. Sauerwald. Speeding up random walks 
with neighborhood exploration. In ACM SODA, 2010. 

[8] S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing markov chain on a graph. SIAM Review, 
46(4):667-689, 2004. 

[9] F. Chen, L. Lovasz, and I. Pak. Lifiting markov chains to speed up mixing. In A CM STOC, 
1999. 

[10] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a nonreversible markov chain sampler. 
Annals of Applied Probability, 10(3):726-752, 2000. 

[11] R. Douc and C. P. Robert. A vanilla Rao-Blackwellization of Metropolis-Hastings algorithms. 
Annals of Statistics, 39(l):261-277, 2011. 

[12] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Practical recommendations on 
crawling online social networks. IEEE J SAC, 2011. 

[13] S. Goel and M. J. Salganik. Respondent-driven sampling as Markov chain Monte Carlo. 
Statistics in Medicine, 28(17) :2202-2229, 2009. 

[14] P. J. Green and A. Mira. Delayed rejection in reversible jump metropolis-hastings. Biometrika, 
88(4):1035-1053, 2001. 

[15] M. A. Hasan and M. J. Zaki. Output space sampling for graph patterns. In VLDB, 2009. 

[16] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. 
Biometrika, 57(1):97-109, 1970. 

[17] S. Ikeda, I. Kubo, and M. Yamashita. The hitting and cover times of random walks on finite 
graphs using local degree information. Theoretical Computer Science, 410(1) :94-100, January 
2009. 

[18] G. L. Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299-320, 2004. 
[19] K. Jung and D. Shah. Fast gossip via nonreversible random walk. In IEEE ITW, 2006. 



30 



[20] M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying 
glass: stratified sampling via weighted random walks. In ACM SIGMETRICS, 2011. 

[21] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. American Math- 
ematical Society, 2009. 

[22] W. Li and H. Dai. Accelerating distributed consensus via lifting markov chains. In IEEE ISIT, 
2007. 

[23] S. Malefaki and G. Iliopoulos. On convergence of properly weighted samples to the target 
distribution. Journal of Statistical Planning and Inference, 138(4):1210-1225, 2008. 

[24] L. Massoulie, E. Le Merrer, A.-M. Kermarrec, and A. Ganesh. Peer counting and sampling in 
overlay networks: random walk methods. In PODC, 2006. 

[25] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of 
state calculations by fast computing machines. Journal of Chemical Physics, 21(6):1087-1092, 
1953. 

[26] A. Mira. Ordering and improving the performance of monte carlo markov chains. Statistical 
Science, 16(4):340-350, 2001. 

[27] R. M. Neal. Improving asymptotic variance of MCMC estimators: non-reversible chains are 
better. Technical report, No. 0406, Department of Statistics, University of Toronto, July 2004. 

[28] P. H. Peskun. Optimum monte-carlo sampling using markov chains. Biometrika, 60:607-612, 
1973. 

[29] A. H. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and D. Stutzbach. Respondent- 
driven sampling for characterizing unstructured overlays. In IEEE INFOCOM, 2009. 

[30] B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random 
walks. In IMC, 2010. 

[31] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. 
Probability Surveys, 1:20-71, 2004. 

[32] S. M. Ross. Stochastic processes. John Wiley & Son, second edition, 1996. 

[33] M. J. Salganik and D. D. Heckathorn. Sampling and estimation in hidden populations using 
respondent-driven sampling. Sociological Methodology, 34:193-239, 2004. 

[34] D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger. On unbiased sampling for 
unstructured peer-to-peer networks. IEEE/ACM Transactions on Networking, 17(2):377-390, 
2009. 

[35] S. S. Wu and M. T. Wells. An extension of the metropolis algorithm. Communications in 
Statistics - Theory and Methods, 34(3):585-596, 2005. 



31 



Appendix 



A Proof of Proposition 1 

Observe that the condition in (|16|) can be written as, for all e^, Bji, ej k ,e k j 6 SI with i ^ k, 

n(j)P(j,i)P'(eij,ejk) = n(j)P(j,k)P'(e k j,eji) 
=>• n(i)P{i,j)P'(eij,e jk ) = 7t(k)P(k,j)P'(e k j,e ji ) 

7r'(eij)P\eij,ejk) = n'(e kj )P'(e kj ,eji), (30) 

which is from the reversibility of the embedded chain {X m } and n'^eij) = n(i)P(i, j), eij £ £1. 
Also, (f30j) trivially holds for i = k. Applying the form of P'(eij,ej k ) in (p7|) to the condition in 
(1301) yields 

7T (eij)P(j, k) +tt (eij)P(j, i)Q (eij,ej eij,ej k ) 

= jr'(e kj )P(j,i) + 7r'(e k j)P(j,k)Q'(e kj ,e ji )A'(e k j,e j i). (31) 

Again from the reversibility of the chain {X m } and Tx'{eij) = 7t(i)P(i, j), it is not difficult to see 
that K , (e ij )P(j,k)=K , (e kj )P(j,i), Tr'( eij )P(j,i) = n(j)P 2 (j,i), and ir'(e k j)P(j, k) = 7r(j)P 2 (j, k). 
Then, observe that (|31|) holds if and only if 

Aii \ _ P 2 (jik)Q'( e kj,ejj) .,, 

A {eij ,e jk) - p2 ^ i)Qi ^ e ^A { e kj ,e ji} 

_ P 2 U,k)Q'(e kj ,e ji ) A/f 

~ P*(j,i)Q'(e ij ,e jk ) A[ekj ' eji) 

= T(e k j , Gji)A (e k j , eji) , (32) 
where the second equality is from P(j, k) = P(j, k)/(l — P(j, j)) (j ^ k) and 

rp, \ a P 2 (j,k)Q'(e kj , eji ) 

Hence, from ([271) and (f30l) . we see that, for any given {Q'(eij, ei k )}, any acceptance probability 
A'(eij,ej k ) satisfying (I32p will make the resulting transition matrix P' (in relation to P) also 
satisfy the two conditions in ([16]) — (TT7T) . 

By nothing that (l33|) asserts T(e k j, eji) = 1/T(eij,ej k ), from (j32j) . we know that the acceptance 
probability A'(eij,ej k ) is generally in the form of F(T(eij,ej k )), where < F < 1 is any arbitrary 
function satisfying F(x) = F(l/x)/x for all x. Among infinitely many possible choices, we choose 
F(x) = min{l,x}, yielding 

A'{eij,e jk ) = mm{l,T(e kj ,eji)} , (34) 
which gives rise to f)28|) . This completes the proof. □ 
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B Proof of Theorem 6 



(I) Proof for almost sure convergence: Define a function 7' : Q,— >M. such that VCey) =7(7). 
See (|22|) for 7(-). We also define another sequence {S,m}m>i which depends on Z' m = X' m ) G SI 

and is geometrically distributed with parameter j'(Z' m ) = ^{X' m ) such that £m = £m- Now, consider 
a sequence of the pairs (Z^^m)- For any / : M— >M, choose another function g : Q— >M. such that 
9(eij) = f(j)- Then, by noting that 

fj ( f \ _ XXi£*7PQ _ YT=\ £"9(z'i) ,oz\ 



it suffices to show that, as m — > 00, 



Era hi 
1=1 «i 



E 7r (/) a.s. 



First, we define 



S m (e.ij) — l{^' =eii }, e^j G 0, 
i=i 

to indicate the number of visits to state during the first m transitions of the (non-reversible) 
Markov chain {Z'{\ over f2. Also, let Jk(eij), k > 1, be the geometrically distributed time duration 
associated with the fe*' 1 visit of the chain {Z',} to state G fi. Observe that 



(36) 



v^m e // /5a V v^^ m (ey) t( \ f \ Sm{e.jj) spS m (ejj) Ji{eij)g(e i:j ) 

£(' v^^ m ( ei i) 1 I c \ V ffm (ej j ) sr^Srn(eij) JjSfiij) 

' ^e^en 2^i=l ^l e ijj Z-,ey m 2^Z=1 SW^e^-) 

By the SLLN for i.i.d. random variables, for each £ fi, as m -> 00 and so S m (eij) —> 00, 

— — - Meij)g(eij) -> 5'(e J j)/7 / (eii) a.s., 

— — — ^ Jiidj) -> 1/V(eij) a.s. 

Since the return times of the chain {Z^} to state ej,- (the time intervals between two consecutive 
visits to eij) are i.i.d. from the strong Markov property, by applying the strong law for renewal 
processes [32J, we also have, as m — > 00, 

Sm ^ e " lj ^ ->. jr'(eij) = Tr(i)P(i,j) a.s., ej,- G 0, 
m 

where 7r' is the unique stationary distribution of the chain {Z' m }. (See Proposition [1] and (|26p .) 
Hence, from ([36]) . we have, as m — > 00, 
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Here, the RHS of (|3"7|) becomes 



E, G ^^0')/7(j) 



The first two equalities are from that P(u, v) = for all (u, u) ^ O (including P(u, u) = for all 
u), %'(eij) = Tr(i)P(i,j) (and so Ej^W^i) = ^0'))) 9^ij) = fij), and 7 / (e ij ) = 7(j), ey€fi. 
The last equality follows from 7r(j) oc ft(j)/'y[j) for all j in (I23p . Therefore, for any function /, 
Am mh(/) converges almost surely to E 7r (/), as m goes to infinity. 

(II) Proof for asymptotic variance: We next prove that the asymptotic variance of pb' m mhda(/) 
is no larger than that of ftm,w\H{f), i-e., c'mhdaC/) — <t mh(/)- Consider a sequence of the pairs 
(X m ,£ m ) by the MH algorithm. For a function / : M — >■ R, we define by T(/) the asymptotic 
variance of the following estimator 

Mfh) = ET=if(Xi)h(Xi) 

Am(l/7) E£l i/7(^0 
It then follows from a special case of Theorem 1 in [11] that, for any function / : J\f— >-R, as m — )• oo, 

v / ^-[Am,MH(/)-IE 7 r(/)]^N(0,^H(/))=N(0,r(/) + A(/)), (39) 

where 

A(/) = E 7r ( 7 )E 7r {Var{e|X}[/(X) -E»(/)] 2 7(X)}, 

and E 7r (7) = EieJV7(*) 7r (*)" Here, the expectation is with respect to X ~ 7r, and £ is geometrically 
distributed with parameter j(X). We notice that this result is obtained from the sequence of the 
pairs (X m ,^ m ) (not from the fact that {Xt} is an irreducible Markov chain) with the stationary 
distribution 7r given by cx 7r(i)/ 7 (i), i GAf. See Theorem 1 in [11] (in addition to Lemma 1 
therein) for more details. Thus, we can similarly apply the result in (|39|) for the sequence of the 
pairs (Z m ,£^J defined earlier in (I). Our proof strategy is to show that the first term T(/) for 

h(/) is no smaller than its corresponding term for <r' MHDA (/), while A(/) remains the same for 
both £>m H (/) and <t'mhda(/)- We start by showing that the latter is true. 

For a given /, again consider another function g : $7 — >W. such that g(eij) = /(j), and recall that 
l'i. e ij) = 7(i)> e ij We define by T'(f) the asymptotic variance of the following estimator 

PLUh) _ JXif&bft&D _ E^^OM^O 



A m (V7) ET=i Vim YZLxViiz'i) ' 

where the RHS of the second equality (the ratio estimator defined based on {Z!}) corresponds to 
(f38|) . In addition, if one finds a semi-Markov chain associated with the sequence (Z m ,£ m ) as was 
done for the MH algorithm, then its stationary distribution, denoted as A, should be (e.g., |32j ) 

A(ejj) cx 7f / (e i j)/7 / (eij), G fi. 
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Together with this, the facts that P(u, v) = for all (u, v) f2 (including P(u, u) = for all it) 

7'(eij) = 7(j), n'(eij)=Tt(i)P(i,j), and vr(i) oc Tr(i)/j(i) give 



Ea(V) = E T'(ey)A( ei . 

1 



Tv{i)h{i) 



^ 7W EieAr*(i)/7(j) 



(40) 



where the denominators in the RHS of the second and fourth equalities are the normalizing constants 
of A(ejj) and vr(i), respectively. Following the same lines, we similarly have E\(g) = E ir (/) for any g 
such that g(eij) = f(j). Then, if we define a random variable y~A (defined on 17) and a geometric 
random variable £" with parameter 7 / (l r ), following the similar arguments above, we obtain 

E A {Var{£"|Y}b(Y) - E x (g)] 2 7 '(Y)} 

= £ VarU ,/ |^ = e ii }[ 5 (e ii )-E A ( 5 )]V(ey)A(e ii ) 



E,wE^(W,i)/7(i) 



£ Var ^l x = J>[/0') - iM/)] 2 7(j) • ^~ 

IMVartflXMX) " Ett(/)] 2 7(^)}, 



7r(i)/7(i) 



eA/" 



(41) 



where Var{^"|y = e^} = Var{£|X = j} follows from 'y'(eij) = &%j € ^- Hence, by applying 

the result in (j39l) for the sequence (Z' m ,^) with (j35l) and ([4T)j) -(l4"T]). we have 



-EaG?) 



(/)-M/)] 

N(0,a /2 MHDA (/)), 



with a' 2 MHDA (f) = T'(f) + A(f). Thus, if T'(/) <T(/), then a /2 MHDA (/) < ^ H (/)- We below show 
that this is indeed true. 
Observe that 



Am (1/7) 



v^m 1 

^=1 7 (JQ) 



Define another function h : N 



E!=i ih&i 

such that 



E™i[/(*0-M/)]/7(^ 
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implying E$f(/i) = X^eJV = 0, which can be seen from that ir(i) oc 7r(i)/j(i), i G7V. Then, 

Theorems [U and [2] say that, as m — > oo, 



^ m 

-J2Vl(Xl) -»• E # (l/7) a.s., 



z=i 

m 



i) 



2=1 



N(0,a 2 (h)), 



and thus, by Slutsky's theorem (and almost sure convergence implies convergence in probability), 
we have, as m — > oo, 



Mfh) 

Am(l/7) 



1 



E*(l/7) 



N(0,a 2 (M). 



Since the process {-X^} has the same stationary distribution #, together with (|13p and 
following the same lines as above, we similarly have, as m — > oo, 



KrAfH) 



1 



N(0,a'\h)). 



E#(l/7) 



Hence, from Theorem [3] and Proposition [H for any function /, the asymptotic variance of p>' m {f) 
(based on {X^}) is no larger than that of ft m (f) (obtained from {X m }), i.e., cr' 2 (f) < & 2 {f), so is 
a' 2 {h) < a 2 (h). Finally, 

r'(/) = y 2 (fc)/(E«(i/7)) a < trUhy&tWi))* = r(/), 

implying that o-'mhda(Z) <°"mh(/)' and we are done - a 



"Specifically, we mean (|13[1 and (|15|l where {-X*}, tt', and 7r are replaced by {Z' m }, {X' m }, tt', and ir, 

respectively. 
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